119 research outputs found
Going Deeper with Semantics: Video Activity Interpretation using Semantic Contextualization
A deeper understanding of video activities extends beyond recognition of
underlying concepts such as actions and objects: constructing deep semantic
representations requires reasoning about the semantic relationships among these
concepts, often beyond what is directly observed in the data. To this end, we
propose an energy minimization framework that leverages large-scale commonsense
knowledge bases, such as ConceptNet, to provide contextual cues to establish
semantic relationships among entities directly hypothesized from video signal.
We mathematically express this using the language of Grenander's canonical
pattern generator theory. We show that the use of prior encoded commonsense
knowledge alleviate the need for large annotated training datasets and help
tackle imbalance in training through prior knowledge. Using three different
publicly available datasets - Charades, Microsoft Visual Description Corpus and
Breakfast Actions datasets, we show that the proposed model can generate video
interpretations whose quality is better than those reported by state-of-the-art
approaches, which have substantial training needs. Through extensive
experiments, we show that the use of commonsense knowledge from ConceptNet
allows the proposed approach to handle various challenges such as training data
imbalance, weak features, and complex semantic relationships and visual scenes.Comment: Accepted to WACV 201
Reducing Training Demands for 3D Gait Recognition with Deep Koopman Operator Constraints
Deep learning research has made many biometric recognition solution viable,
but it requires vast training data to achieve real-world generalization. Unlike
other biometric traits, such as face and ear, gait samples cannot be easily
crawled from the web to form massive unconstrained datasets. As the human body
has been extensively studied for different digital applications, one can rely
on prior shape knowledge to overcome data scarcity. This work follows the
recent trend of fitting a 3D deformable body model into gait videos using deep
neural networks to obtain disentangled shape and pose representations for each
frame. To enforce temporal consistency in the network, we introduce a new
Linear Dynamical Systems (LDS) module and loss based on Koopman operator
theory, which provides an unsupervised motion regularization for the periodic
nature of gait, as well as a predictive capacity for extending gait sequences.
We compare LDS to the traditional adversarial training approach and use the USF
HumanID and CASIA-B datasets to show that LDS can obtain better accuracy with
less training data. Finally, we also show that our 3D modeling approach is much
better than other 3D gait approaches in overcoming viewpoint variation under
normal, bag-carrying and clothing change conditions
- …